The New York Bike Load dataset contains the bike load information of 4 different bridges, Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge and Queensboro Bridge with given information on date, day, high temperature, low temperature and precipitation.

In my frequent Itemset Analysis, I would like to examine the relationship between temperature, Precipitation and the bike load on Manhattan Bridge. I would like to know in most cases what causes increment or decrement of bike load on the bridge. When temperature goes up, and the precipitation also goes up, does the bike load goes up or down in most of the cases? What will happen to bike load in most of time when the temperature and the precipitation both go up?

To examine these relationship, we need change the data in the way that represent the relationship between 2 days. If the temperature goes up, we write 1. If the temperature goes down, we write -1, if it’s not changing, we write 0. After that we can use PyMing package to determine which combination of -1, 0, 1 occur the most frequently.

Step1: Get the data needed from database


In [7]:
import sqlite3
import pandas as pd
from pprint import pprint
from pandas import DataFrame
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import math
import numpy as np
conn = sqlite3.connect('bicycle.db')
c=conn.cursor()
c.execute('SELECT LoTemp, Precip, Manhattan FROM bicycle')
data=c.fetchall()

Step2: Transform the Data

In this step, I need to create another dataframe, use for loop to fill in the datafram based on the old dataframe.


In [8]:
Data=DataFrame(data, columns=[ 'LoTemp', 'Precip', 'Manhattan'])
n=len(Data)
newData = DataFrame( index=range(0,n-1),columns=[ 'LoTemp', 'Precip', 'Manhattan'])

Data.loc[1,'LoTemp']
newData.loc[1,'LoTemp']=1

In [9]:
for x in range(0,n-1):
    for y in [ 'LoTemp', 'Precip', 'Manhattan']:
        if (Data.loc[x,y]<Data.loc[x+1,y]):
            newData.loc[x, y]=1
        elif (Data.loc[x,y]>Data.loc[x+1,y]):
            newData.loc[x, y]=-1 
        else:
            newData.loc[x, y]=0
datalist=newData.values.tolist()

Step3: PyMining

In my data, each colum represent different attributes so the sequence of the dataset actually matters. I decided to use teh sequence mining method.


In [19]:
from pymining import seqmining
freq_seqs = seqmining.freq_seq_enum(datalist, 20)
sorted(freq_seqs)


Out[19]:
[((-1,), 162),
 ((-1, -1), 68),
 ((-1, -1, 1), 22),
 ((-1, 0), 32),
 ((-1, 1), 79),
 ((0,), 100),
 ((0, -1), 57),
 ((0, 1), 49),
 ((1,), 169),
 ((1, -1), 88),
 ((1, 0), 51),
 ((1, 0, -1), 28),
 ((1, 0, 1), 23),
 ((1, 1), 70),
 ((1, 1, -1), 23)]

In [ ]:
conn.close()

From the sequence frequency result of set of 3 itemset. we see that the most frequently occurred sequence is (1, 0, -1), which have a frequency of 28. That meas when the temperature goes up, and the precipitation doesn't change the bike load goes down. One thing to notice is that it's hard for the temperature growth indicator value to be 0, since little change in temperature will still indicate a increase or decrease in temperature. This is the same for days when we actually have precipitation. So I believe that most of the instances when the precipitation growth indicator value is 0, that means the precipitaion is 0 or this pair of consecutive days. So fact high frequency of (1, 0, -1) means actually means that when the temperature goes up and there's no precipitation, the bike load would most likely to go down. And the second frequently occured sequence is (1, 0, 1), which means that when the temperature increase and there's no precipitation, bike load would most likely to go down. And these two completely different findings only has 5 count different. By looking at the data again, I found that I ignored one important factor, date. The bike count data was collected from April to October. So it make sense that during cold seasone, people prefer to take a bike when it's relatively warmer, and during hot season, people would prefer to take a bike when the weather is relatively cooler.

There are other 2 frequent sequence, (-1, -1, 1) is counted 22, and (1, 1, -1) is counted 23. These two tells the same about the relation ship between the precipitation and bike load. Bike load decreases when precipitation goes up, and increases when precipitation goes down. And as we can see from the frequent sequences that we barely see positive precipitation paired with negative temperature in any of the 4 common set-of-3 sequences. That indecates that, in new york, it rains more often during cold seasons.


In [ ]: